Git Blame Who?: Stylistic Authorship Attribution of Small, Incomplete Source Code Fragments

نویسندگان

  • Edwin Dauber
  • Aylin Caliskan
  • Richard E. Harang
  • Rachel Greenstadt
چکیده

Program authorship attribution has implications for the privacy of programmers who wish to contribute code anonymously. While previous work has shown that complete files that are individually authored can be attributed, we show here for the first time that accounts belonging to open source contributors containing short, incomplete, and typically uncompilable fragments can also be effectively attributed. We propose a technique for authorship attribution of contributor accounts containing small source code samples, such as those that can be obtained from version control systems or other direct comparison of sequential versions. We show that while application of previous methods to individual small source code samples yields an accuracy of about 73% for 106 programmers as a baseline, by ensembling and averaging the classification probabilities of a sufficiently large set of samples belonging to the same author we achieve 99% accuracy for assigning the set of samples to the correct author. Through these results, we demonstrate that attribution is an important threat to privacy for programmers even in real-world collaborative environments such as GitHub. Additionally, we propose the use of calibration curves to identify samples by unknown and previously unencountered authors in the open world setting. We show that we can also use these calibration curves in the case that we do not have linking information and thus are forced to classify individual samples directly. This is because the calibration curves allow us to identify which samples are more likely to have been correctly attributed. Using such a curve can help an analyst choose a cut-off point which will prevent most misclassifications, at the cost of causing the rejection of some of the more dubious correct attributions.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Who Wrote This Code? Identifying the Authors of Program Binaries

Program authorship attribution—identifying a programmer based on stylistic characteristics of code—has practical implications for detecting software theft, digital forensics, and malware analysis. Authorship attribution is challenging in these domains where usually only binary code is available; existing source code-based approaches to attribution have left unclear whether and to what extent pr...

متن کامل

On the Feasibility of Malware Authorship Attribution

There are many occasions in which the security community is interested to discover the authorship of malware binaries, either for digital forensics analysis of malware corpora or for thwarting live threats of malware invasion. Such a discovery of authorship might be possible due to stylistic features inherent to software codes written by human programmers. Existing studies of authorship attribu...

متن کامل

Recognizing and Imitating Programmer Style: Adversaries in Program Authorship Attribution

Source code attribution classifiers have recently become powerful. We consider the possibility that an adversary could craft code with the intention of causing a misclassification, i.e., creating a forgery of another author’s programming style in order to hide the forger’s own identity or blame the other author. We find that it is possible for a non-expert adversary to defeat such a system. In ...

متن کامل

Exploiting Stylistic Idiosyncrasies for Authorship Attribution

Introduction Early researchers in authorship attribution used a variety of statistical methods to identify stylistic discriminators – characteristics which remain approximately invariant within the works of a given author but which tend to vary from author to author (Holmes 1998, McEnery & Oakes 2000). In recent years machine learning methods have been applied to authorship attribution. A few e...

متن کامل

Attribution of Contested and Anonymous Ancient Greek Works

Authorship attribution has been a persistent problem in the Classical genre, as texts that reach us from antiquity are often corrupted, edited, or forged over the thousands of years since their initial production. Scholars have worked on identifying writers’ stylistic differences in an attempt to distinguish genuine texts from fakes, and to attribute an author to previously anonymous works. Inc...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • CoRR

دوره abs/1701.05681  شماره 

صفحات  -

تاریخ انتشار 2017